Search CORE

10 research outputs found

ECAPA-TDNN Embeddings for Speaker Diarization

Author: Dawalatabad Nauman
Desplanques Brecht
Grondin François
Na Hwidong
Ravanelli Mirco
Thienpondt Jenthe
Publication venue
Publication date: 01/01/2021
Field of study

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model. In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

Author: Dawalatabad Nauman
Gimeno Pablo
Glass James
Khurana Sameer
Laurent Antoine
Mingote Victoria
Vicente Luis
Publication venue
Publication date: 01/06/2023
Field of study

Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark

arXiv.org e-Print Archive

Two-Pass IB based Speaker Diarization System using Meeting-Specific ANN based Features

Author: Dawalatabad Nauman
Madikeri Srikanth
Murthy Hema A
Sekhar C Chandra
Publication venue
Publication date: 19/12/2016
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Incremental Transfer Learning in Two-pass Information Bottleneck Based Speaker Diarization System for Meetings

Author: Dawalatabad Nauman
Madikeri Srikanth
Murthy Hema A
Sekhar C Chandra
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2020
Field of study

The two-pass information bottleneck (TPIB) based speaker diarization system operates independently on different conversational recordings. TPIB system does not consider previously learned speaker discriminative information while di-arizing new conversations. Hence, the real time factor (RTF) of TPIB system is high owing to the training time required for the artificial neural network (ANN). This paper attempts to improve the RTF of the TPIB system using an incremental transfer learning approach where the parameters learned by the ANN from other conversations are updated using current conversation rather than learning parameters from scratch. This reduces the RTF significantly. The effectiveness of the proposed approach compared to the baseline IB and the TPIB systems is demonstrated on standard NIST and AMI conversational meeting datasets. With a minor degradation in performance, the proposed system shows a significant improvement of 33.07% and 24.45% in RTF with respect to TPIB system on the NIST RT-04Eval and AMI-1 datasets, respectively

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Two-Pass IB based Speaker Diarization System using Meeting-Specific ANN based Features

Author: Dawalatabad Nauman
Madikeri Srikanth
Murthy Hema A
Sekhar C Chandra
Publication venue: Idiap
Publication date: 26/07/2018
Field of study

Infoscience - École polytechnique fédérale de Lausanne

ECAPA-TDNN embeddings for speaker diarization

Author: Dawalatabad Nauman
Desplanques Brecht
Grondin François
Na Hwidong
Ravanelli Mirco
Thienpondt Jenthe
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2021
Field of study

Ghent University Academic Bibliography

SpeechBrain: A General-Purpose Speech Toolkit

Author: Aris William
Bengio Yoshua
Chou Ju-Chieh
Cornell Samuele
Dawalatabad Nauman
De Mori Renato
Fu Szu-Wei
Gao Yan
Grondin François
Heba Abdelwahab
Liao Chien-Feng
Lugosch Loren
Na Hwidong
Parcollet Titouan
Plantinga Peter
Rastorgueva Elena
Ravanelli Mirco
Rouhe Aku
Subakan Cem
Yeh Sung-Lin
Zhong Jianyuan
Publication venue: HAL CCSD
Publication date: 08/06/2021
Field of study

PreprintSpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech processing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies

arXiv.org e-Print Archive

Scientific Publications of the University of Toulouse II Le Mirail